Case Study 3: Flagging Spam Emails

Team: Ana Glaser, Jake Harrison, Rob Burigo, and Yvan Sojdehei


Business Understanding

The company in question has workers recieving vast amounts of emails every day. The objective of this Case Study is to classify whether or not an email is spam or work-related to streamline inboxes to only include the important emails.


Modeling Preparations

Methods:

The methods our team applied to solve this problem include Naive Bayes and K-Means Clustering. In preparation for the modeling, the team partitioned each email segment from the email file into a structured form. Then various natural language processing cleansing techniques were implemented to conduct the exploratory data analysis. Due to the unique nature of the input data, an email file, this approach seemed appropriate to feature engineer the enriched data available within the file structure.

Evaluation Metrics:

The metrics our team utilized for this project include F-1 Score and the Confusion Matrix for the Naive Bayes.

Initially we intended to use Accuracy and Precision, but due to the imbalanced nature of the data we pivoted towards F1 Score.

The main reason our team chose to use these metrics is because they evaluate the model performance better with disproportionate classifications that exist within our dataset. F1 is a calculation of recall and precision which represents a more holisitic view of the success of our Naive Bayes model.

Importing Packages

In [1]:
import os
import pandas as pd
import numpy as np
from email.parser import BytesParser, Parser
from email.policy import default
import email
from html.parser import HTMLParser
from bs4 import BeautifulSoup
from collections import Counter
import re
import plotly.express as px
import nltk
import seaborn as sns
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import statsmodels.api as sm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from yellowbrick.model_selection import FeatureImportances

from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

Defining Functions to use within the Feature Engineering section.

In [2]:
def strip_html(text):
    soup = BeautifulSoup(text,"html.parser")
    return soup.get_text()


def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stop_list:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    lemmatizer = WordNetLemmatizer()
    new_words = []
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def normalize(words):
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

stop_words = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_list = list(set(stop_words) - set(customlist))  

Data Evaluation / Engineering

Importing Data. We consolidated all email file directories into a ham and nonspam folder for ease of processing.

In [3]:
spam_list = os.listdir("./spam/")
ham_list = os.listdir("./easy_ham/")

First Iteration is to clean and combine the non spam email file headers with its content.

In [4]:
folder=("./easy_ham")
files=os.listdir(folder)
emails=[folder+'/'+file for file in files]
In [5]:
dataStackText = pd.DataFrame()
my_dict_text = {"Words":[], "Type":[]}

words=[]
for email in emails:
    f = open(email, encoding='latin-1')
    blob = f.read()
    my_dict_text["Words"].append(blob)
    my_dict_text['Type'].append(0)
dataStackText = pd.DataFrame(my_dict_text)
In [6]:
dataStack = pd.DataFrame()
my_dict = {"To":[], "From":[], "Subject":[]}

for i in ham_list:
    with open("./easy_ham/"+i, 'rb') as fp:
        try:
            headers = BytesParser(policy=default).parse(fp)
            to_text = '{}'.format(headers['to'])
            from_text = '{}'.format(headers['from'])
            subject = '{}'.format(headers['subject'])    
            my_dict["To"].append(to_text)
            my_dict["From"].append(from_text)
            my_dict["Subject"].append(subject)
        except Exception:
            continue

dataStack = pd.DataFrame(my_dict)
In [7]:
EMAIL_DF = pd.merge(dataStack,dataStackText, how='left',left_index = True, right_index = True)
In [8]:
COMPLETE_EMAIL_DF = EMAIL_DF
COMPLETE_EMAIL_DF.shape
Out[8]:
(6940, 5)

Second Iteration is to clean and combine the spam email file headers with its content.

In [9]:
folder=("./spam")
files=os.listdir(folder)
emails=[folder+'/'+file for file in files]
In [10]:
dataStackText = pd.DataFrame()
my_dict_text = {"Words":[], "Type":[]}

words=[]
for email in emails:
    f = open(email, encoding='latin-1')
    blob = f.read()
    my_dict_text["Words"].append(blob)
    my_dict_text['Type'].append(1)
dataStackText = pd.DataFrame(my_dict_text)
In [11]:
dataStack = pd.DataFrame()
my_dict = {"To":[], "From":[], "Subject":[]}

for i in spam_list:
    with open("./spam/"+i, 'rb') as fp:
        try:
            headers = BytesParser(policy=default).parse(fp)
            #print(headers)
            to_text = '{}'.format(headers['to'])
            from_text = '{}'.format(headers['from'])
            subject = '{}'.format(headers['subject'])
            my_dict["To"].append(to_text)
            my_dict["From"].append(from_text)
            my_dict["Subject"].append(subject)
        except Exception:
            continue

dataStack = pd.DataFrame(my_dict)

We stacked the nonspam and spam datasets together to create one dataframe to analyze.

In [12]:
EMAIL_DF = pd.merge(dataStack,dataStackText, how='left',left_index = True, right_index = True)
In [13]:
COMPLETE_EMAIL_DF = COMPLETE_EMAIL_DF.append(EMAIL_DF)
COMPLETE_EMAIL_DF.shape
Out[13]:
(9338, 5)
In [14]:
COMPLETE_EMAIL_DF.head()
Out[14]:
To From Subject Words Type
0 Chris Garrigues <cwg-dated-1030314468.7c7c85@D... Robert Elz <kre@munnari.OZ.AU> Re: New Sequences Window Return-Path: <exmh-workers-admin@spamassassin.... 0
1 mkettler@home.com The Motley Fool <Fool@motleyfool.com> Personal Finance: Resolutions You Can Keep Return-Path: Fool@motleyfool.com\nDelivery-Dat... 0
2 Valdis.Kletnieks@vt.edu Chris Garrigues <cwg-exmh@DeepEddy.Com> Re: New Sequences Window From exmh-workers-admin@redhat.com Wed Aug 21... 0
3 rod-3ds@arsecandle.org malcolm-sweeps@mrichi.com Malcolm in the Middle Sweepstakes Prize Notifi... Return-Path: <malcolm-sweeps@mrichi.com>\nDeli... 0
4 Robert Elz <kre@munnari.OZ.AU> Chris Garrigues <cwg-exmh@DeepEddy.Com> Re: New Sequences Window From exmh-workers-admin@redhat.com Wed Aug 21... 0

Normalized the data for analysis using standard Natural Language Processing data cleansing techniques such as stripping html, removing punctuation, stop words, and made everything lowercase.

In [15]:
# Removing any excess html
COMPLETE_EMAIL_DF['Words'] = COMPLETE_EMAIL_DF['Words'].apply(lambda x: strip_html(x))
COMPLETE_EMAIL_DF['Subject'] = COMPLETE_EMAIL_DF['Subject'].apply(lambda x: strip_html(x))
In [16]:
# Tokenizing each column
COMPLETE_EMAIL_DF['Words_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['Words']), axis=1)
COMPLETE_EMAIL_DF['Subject_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['Subject']), axis=1)
COMPLETE_EMAIL_DF['To_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['To']), axis=1)
COMPLETE_EMAIL_DF['From_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: nltk.word_tokenize(row['From']), axis=1)
In [17]:
# normalizing each column

COMPLETE_EMAIL_DF['Subject_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['Subject_norm']), axis=1)
COMPLETE_EMAIL_DF['Words_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['Words_norm']), axis=1)
COMPLETE_EMAIL_DF['To_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['To_norm']), axis=1)
COMPLETE_EMAIL_DF['From_norm'] = COMPLETE_EMAIL_DF.apply(lambda row: normalize(row['From_norm']), axis=1)
In [18]:
COMPLETE_EMAIL_DF.head()
Out[18]:
To From Subject Words Type Words_norm Subject_norm To_norm From_norm
0 Chris Garrigues <cwg-dated-1030314468.7c7c85@D... Robert Elz <kre@munnari.OZ.AU> Re: New Sequences Window Return-Path: \nDelivered-To: yyyy@localhost.ne... 0 returnpath deliveredto yyyy localhostnetnotein... new sequence window chris garrigues cwgdated10303144687c7c85 deepe... robert elz kre munnariozau
1 mkettler@home.com The Motley Fool <Fool@motleyfool.com> Personal Finance: Resolutions You Can Keep Return-Path: Fool@motleyfool.com\nDelivery-Dat... 0 returnpath fool motleyfoolcom deliverydate wed... personal finance resolutions keep mkettler homecom motley fool fool motleyfoolcom
2 Valdis.Kletnieks@vt.edu Chris Garrigues <cwg-exmh@DeepEddy.Com> Re: New Sequences Window From exmh-workers-admin@redhat.com Wed Aug 21... 0 exmhworkersadmin redhatcom wed aug 21 161835 2... new sequence window valdiskletnieks vtedu chris garrigues cwgexmh deepeddycom
3 rod-3ds@arsecandle.org malcolm-sweeps@mrichi.com Malcolm in the Middle Sweepstakes Prize Notifi... Return-Path: \nDelivered-To: rod@arsecandle.or... 0 returnpath deliveredto rod arsecandleorg recei... malcolm middle sweepstakes prize notification rod3ds arsecandleorg malcolmsweeps mrichicom
4 Robert Elz <kre@munnari.OZ.AU> Chris Garrigues <cwg-exmh@DeepEddy.Com> Re: New Sequences Window From exmh-workers-admin@redhat.com Wed Aug 21... 0 exmhworkersadmin redhatcom wed aug 21 161836 2... new sequence window robert elz kre munnariozau chris garrigues cwgexmh deepeddycom

Created a list of most common spam words to leverage in the feature engineering functions.

In [19]:
folder=("./spam")
files=os.listdir(folder)
emails=[folder+'/'+file for file in files]
In [20]:
words=[]
for email in emails:
    f = open(email, encoding='latin-1')
    blob = f.read()
    words += blob.split(" ")
words = to_lowercase(words)
words = remove_stopwords(words)
In [21]:
for i in range(len(words)):
    if not words[i].isalpha():
        words[i]=""
word_dict = Counter(words)
del word_dict[""]
In [22]:
word_dict = word_dict.most_common(1000)
word_dict = [k for k,v in word_dict]

Now we have a dataframe that contains a series of the original and normalized email content.

In [23]:
COMPLETE_EMAIL_DF.head()
Out[23]:
To From Subject Words Type Words_norm Subject_norm To_norm From_norm
0 Chris Garrigues <cwg-dated-1030314468.7c7c85@D... Robert Elz <kre@munnari.OZ.AU> Re: New Sequences Window Return-Path: \nDelivered-To: yyyy@localhost.ne... 0 returnpath deliveredto yyyy localhostnetnotein... new sequence window chris garrigues cwgdated10303144687c7c85 deepe... robert elz kre munnariozau
1 mkettler@home.com The Motley Fool <Fool@motleyfool.com> Personal Finance: Resolutions You Can Keep Return-Path: Fool@motleyfool.com\nDelivery-Dat... 0 returnpath fool motleyfoolcom deliverydate wed... personal finance resolutions keep mkettler homecom motley fool fool motleyfoolcom
2 Valdis.Kletnieks@vt.edu Chris Garrigues <cwg-exmh@DeepEddy.Com> Re: New Sequences Window From exmh-workers-admin@redhat.com Wed Aug 21... 0 exmhworkersadmin redhatcom wed aug 21 161835 2... new sequence window valdiskletnieks vtedu chris garrigues cwgexmh deepeddycom
3 rod-3ds@arsecandle.org malcolm-sweeps@mrichi.com Malcolm in the Middle Sweepstakes Prize Notifi... Return-Path: \nDelivered-To: rod@arsecandle.or... 0 returnpath deliveredto rod arsecandleorg recei... malcolm middle sweepstakes prize notification rod3ds arsecandleorg malcolmsweeps mrichicom
4 Robert Elz <kre@munnari.OZ.AU> Chris Garrigues <cwg-exmh@DeepEddy.Com> Re: New Sequences Window From exmh-workers-admin@redhat.com Wed Aug 21... 0 exmhworkersadmin redhatcom wed aug 21 161836 2... new sequence window robert elz kre munnariozau chris garrigues cwgexmh deepeddycom

Feature Engineering:

As part of the analysis, we appended a series of data characteristics based on spam emails' key word frequency.

Passing most common spam words and getting a count for each email's subject and content.

In [24]:
spam_list_words = word_dict
COMPLETE_EMAIL_DF['spam_count_content'] = COMPLETE_EMAIL_DF['Words_norm'].apply(lambda x: sum(i in spam_list_words for i in x.split()))
COMPLETE_EMAIL_DF['spam_count_subject'] = COMPLETE_EMAIL_DF['Subject_norm'].apply(lambda x: sum(i in spam_list_words for i in x.split()))

Getting a capital letter count for each email's from, subject, and content.

In [25]:
COMPLETE_EMAIL_DF['subject_cl_count'] = COMPLETE_EMAIL_DF['Subject'].apply(lambda x: sum(1 for c in x if c.isupper()))
COMPLETE_EMAIL_DF['from_cl_count'] = COMPLETE_EMAIL_DF['From'].apply(lambda x: sum(1 for c in x if c.isupper()))
COMPLETE_EMAIL_DF['content_cl_count'] = COMPLETE_EMAIL_DF['Words'].apply(lambda x: sum(1 for c in x if c.isupper()))

Getting the character count and digit count of the from address as this could flag possible spam

In [26]:
COMPLETE_EMAIL_DF['from_ch_count'] = COMPLETE_EMAIL_DF['From'].apply(lambda x: sum(1 for c in x if c!=''))
COMPLETE_EMAIL_DF['from_int_count'] = COMPLETE_EMAIL_DF['From'].apply(lambda x: sum(1 for c in x if c.isdigit()))

Creating a boolean field for whether the from address ends in com, us, and edu.

In [27]:
COMPLETE_EMAIL_DF['from_dotcom'] = COMPLETE_EMAIL_DF['From_norm'].apply(lambda x: x.endswith("com"))
COMPLETE_EMAIL_DF['from_dotedu'] = COMPLETE_EMAIL_DF['From_norm'].apply(lambda x: x.endswith("edu"))
COMPLETE_EMAIL_DF['from_dotus'] = COMPLETE_EMAIL_DF['From_norm'].apply(lambda x: x.endswith("us"))
In [28]:
COMPLETE_EMAIL_DF['is_spam'] = COMPLETE_EMAIL_DF['Type']

Dropping columns to get final dataset to start our EDA

In [29]:
data_final = COMPLETE_EMAIL_DF.drop(['To','From','Subject','Type'], axis=1)
In [30]:
data_final
Out[30]:
Words Words_norm Subject_norm To_norm From_norm spam_count_content spam_count_subject subject_cl_count from_cl_count content_cl_count from_ch_count from_int_count from_dotcom from_dotedu from_dotus is_spam
0 Return-Path: \nDelivered-To: yyyy@localhost.ne... returnpath deliveredto yyyy localhostnetnotein... new sequence window chris garrigues cwgdated10303144687c7c85 deepe... robert elz kre munnariozau 302 1 4 6 382 30 0 False False False 0
1 Return-Path: Fool@motleyfool.com\nDelivery-Dat... returnpath fool motleyfoolcom deliverydate wed... personal finance resolutions keep mkettler homecom motley fool fool motleyfoolcom 317 2 6 4 718 37 0 True False False 0
2 From exmh-workers-admin@redhat.com Wed Aug 21... exmhworkersadmin redhatcom wed aug 21 161835 2... new sequence window valdiskletnieks vtedu chris garrigues cwgexmh deepeddycom 117 1 4 5 409 39 0 True False False 0
3 Return-Path: \nDelivered-To: rod@arsecandle.or... returnpath deliveredto rod arsecandleorg recei... malcolm middle sweepstakes prize notification rod3ds arsecandleorg malcolmsweeps mrichicom 535 0 5 0 1520 25 0 True False False 0
4 From exmh-workers-admin@redhat.com Wed Aug 21... exmhworkersadmin redhatcom wed aug 21 161836 2... new sequence window robert elz kre munnariozau chris garrigues cwgexmh deepeddycom 134 1 4 5 423 39 0 True False False 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2393 From cna@insiq.us Tue Oct 8 00:10:39 2002\nR... cna insiqus tue oct 8 001039 2002 returnpath d... hit road cna zzzz jmasonorg iq cna cna insiqus 316 2 5 5 963 23 0 False False True 1
2394 From bounce2@u-answer.com Tue Oct 8 11:02:30... bounce2 uanswercom tue oct 8 110230 2002 retur... 10 hour watch emmercials joke undisclosedrecipients answerus davicomcokr 97 2 1 2 137 23 0 False False False 1
2395 From beautyinfufuxxxmeb13mxy@aol.com Tue Oct ... beautyinfufuxxxmeb13mxy aolcom tue oct 8 11023... make fortune ebay 24772 mike dogmaslashnullorg beautyinfufuxxxmeb13mxy aolcom 107 3 4 0 134 31 2 True False False 1
2396 From evtwqmigru@datcon.co.uk Tue Oct 8 11:02... evtwqmigru datconcouk tue oct 8 110237 2002 re... faeries wciml chezcom time evtwqmigru datconcouk 890 0 1 2 1700 36 0 False False False 1
2397 mv 00001.7848dde101aa985090474a91ec93fcf0 0000... mv 000017848dde101aa985090474a91ec93fcf0 00001... none none none 0 0 1 1 0 4 0 False False False 1

9338 rows × 16 columns

Exploratory Data Analysis

After the final dataset is created, we had to ensure there were no missing values.

In [31]:
data_final.isnull().sum()
Out[31]:
Words                 0
Words_norm            0
Subject_norm          0
To_norm               0
From_norm             0
spam_count_content    0
spam_count_subject    0
subject_cl_count      0
from_cl_count         0
content_cl_count      0
from_ch_count         0
from_int_count        0
from_dotcom           0
from_dotedu           0
from_dotus            0
is_spam               0
dtype: int64

Displaying the summary of our dataset. We have observed some character attributes having max values that are significantly higher than the mean or median resulting in a skewed dataset.

In [32]:
data_final.describe()
Out[32]:
spam_count_content spam_count_subject subject_cl_count from_cl_count content_cl_count from_ch_count from_int_count is_spam
count 9338.000000 9338.000000 9338.000000 9338.000000 9338.000000 9338.000000 9338.000000 9338.000000
mean 120.401906 1.439602 4.982652 2.017027 389.711501 34.314093 0.845256 0.256800
std 165.282288 1.414154 5.453663 2.834296 2548.711998 12.149288 2.491055 0.436892
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 63.000000 0.000000 2.000000 0.000000 148.000000 27.000000 0.000000 0.000000
50% 87.000000 1.000000 4.000000 2.000000 208.000000 33.000000 0.000000 0.000000
75% 123.000000 2.000000 6.000000 2.000000 295.000000 40.000000 0.000000 1.000000
max 3748.000000 28.000000 81.000000 32.000000 116506.000000 144.000000 45.000000 1.000000

Histograms of our continuous variables where we can visually observe the skewness within our dataset.

In [33]:
columns = list(data_final.select_dtypes('int64'))
data_final[columns].hist(stacked=False, bins=100, figsize=(15,15), layout=(5,3)); 
In [34]:
data_final.skew()
Out[34]:
spam_count_content     9.379257
spam_count_subject     1.747891
subject_cl_count       3.572663
from_cl_count          3.644891
content_cl_count      31.577371
from_ch_count          1.582716
from_int_count         4.158932
from_dotcom           -0.006855
from_dotedu            5.635035
from_dotus            11.680077
is_spam                1.113557
dtype: float64

Looking at the mean for both of our spam count columns. Can see that for both the mean is higher for the spam emails (1) than it is for the ham emails (0)

In [35]:
data_final.groupby(['is_spam'])['spam_count_content'].mean()
Out[35]:
is_spam
0    111.692507
1    145.607590
Name: spam_count_content, dtype: float64
In [36]:
data_final.groupby(['is_spam'])['spam_count_subject'].mean()
Out[36]:
is_spam
0    1.168300
1    2.224771
Name: spam_count_subject, dtype: float64

We identified an outlier within the spam word count of the subject line which needs to be addressed before moving forward. Considering this is only one record, we will remove it instead of replacing it with the top whisker, also known as the 75% percentile.

In [37]:
sns.boxplot(x="is_spam", y="spam_count_subject", data=data_final)
Out[37]:
<AxesSubplot:xlabel='is_spam', ylabel='spam_count_subject'>

After the outlier was removed, it was easier for us to observe the magnitude of the difference in spam versus non-spam profiles. Spam subject and content word count distribution is higher than non-spam emails.

In [38]:
data_final = data_final[data_final['spam_count_subject'] < 20]
sns.boxplot(x="is_spam", y="spam_count_subject", data=data_final)
Out[38]:
<AxesSubplot:xlabel='is_spam', ylabel='spam_count_subject'>
In [39]:
data_final = data_final[data_final['spam_count_content'] < 300]
sns.boxplot(x="is_spam", y="spam_count_content", data=data_final)
Out[39]:
<AxesSubplot:xlabel='is_spam', ylabel='spam_count_content'>

After eliminating the outlier observations from the dataset, the skewness was reduced significantly for those attributes.

In [40]:
data_final.skew()
Out[40]:
spam_count_content     1.288255
spam_count_subject     1.104251
subject_cl_count       3.611340
from_cl_count          3.616751
content_cl_count      31.300991
from_ch_count          1.565681
from_int_count         4.187774
from_dotcom            0.015974
from_dotedu            5.773160
from_dotus            12.483059
is_spam                1.145743
dtype: float64

The structure of our model entails aggregating the frequency of spam words and email segment word counts such as words in subject line and words in the email body to determine if an email is spam or not.

In [41]:
model_data = data_final.select_dtypes(exclude = 'object')
In [42]:
model_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8891 entries, 2 to 2397
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   spam_count_content  8891 non-null   int64
 1   spam_count_subject  8891 non-null   int64
 2   subject_cl_count    8891 non-null   int64
 3   from_cl_count       8891 non-null   int64
 4   content_cl_count    8891 non-null   int64
 5   from_ch_count       8891 non-null   int64
 6   from_int_count      8891 non-null   int64
 7   from_dotcom         8891 non-null   bool 
 8   from_dotedu         8891 non-null   bool 
 9   from_dotus          8891 non-null   bool 
 10  is_spam             8891 non-null   int64
dtypes: bool(3), int64(8)
memory usage: 971.2 KB

There is distinct separation in the attributes of emails that are spam versus not spam. Our model will leverage these engineered data features to execute the classification task.

In [43]:
fig = px.scatter_3d(model_data, x='spam_count_subject', y='from_ch_count', z='from_cl_count', color='is_spam', title="Separation of Spam")
fig.update_layout(width = 550, height = 550,margin=dict(l=0, r=0, b=0, t=0))
fig.show()

Model Building & Evaluation

KMeans Clustering:

KMeans Clustering utilizes a user-specified amount of clusters, k, and uses the calculated means to bucket the observations into their designated cluster.

In [44]:
x = model_data.iloc[:]# 1t for rows and second for columns
x
Out[44]:
spam_count_content spam_count_subject subject_cl_count from_cl_count content_cl_count from_ch_count from_int_count from_dotcom from_dotedu from_dotus is_spam
2 117 1 4 5 409 39 0 True False False 0
4 134 1 4 5 423 39 0 True False False 0
5 35 1 1 4 51 33 0 True False False 0
8 92 1 4 2 355 23 0 False True False 0
10 70 0 4 0 410 35 4 False False False 0
... ... ... ... ... ... ... ... ... ... ... ...
2391 190 3 4 1 340 29 9 True False False 1
2392 190 3 2 1 339 25 5 True False False 1
2394 97 2 1 2 137 23 0 False False False 1
2395 107 3 4 0 134 31 2 True False False 1
2397 0 0 1 1 0 4 0 False False False 1

8891 rows × 11 columns

Utilizing the elbow method, the optimal number of clusters were identified as 2. There does seem to be a second fold in the elbow at 4 clusters. After the evaluation of the clusters, we determined 2 yielded better separation. We will utilize these cluster labels within our Naive Bayes model and assess the model accuracy to determine their effectiveness.

In [45]:
sse = {}
# Fit KMeans and calculate SSE for each k
for k in range(1, 10):
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)  
    # Fit KMeans on the normalized dataset
    kmeans.fit(model_data)  
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_
# Plotting the elbow plot
plt.figure(figsize=(12,8))
plt.title('The Elbow Method')
plt.xlabel('k'); 
plt.ylabel('Sum of squared errors')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()
In [46]:
kmeans = KMeans(n_clusters=2, algorithm = 'full')
kmeans.fit(x)
Out[46]:
KMeans(algorithm='full', n_clusters=2)

Cluster the existing data

In [47]:
identified_clusters = kmeans.fit_predict(x)
identified_clusters
Out[47]:
array([0, 0, 0, ..., 0, 0, 0])

Merge the clusters with the main dataset

In [48]:
data_with_clusters = model_data.copy()
data_with_clusters['Clusters'] = identified_clusters 
In [49]:
data_with_clusters
Out[49]:
spam_count_content spam_count_subject subject_cl_count from_cl_count content_cl_count from_ch_count from_int_count from_dotcom from_dotedu from_dotus is_spam Clusters
2 117 1 4 5 409 39 0 True False False 0 0
4 134 1 4 5 423 39 0 True False False 0 0
5 35 1 1 4 51 33 0 True False False 0 0
8 92 1 4 2 355 23 0 False True False 0 0
10 70 0 4 0 410 35 4 False False False 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
2391 190 3 4 1 340 29 9 True False False 1 0
2392 190 3 2 1 339 25 5 True False False 1 0
2394 97 2 1 2 137 23 0 False False False 1 0
2395 107 3 4 0 134 31 2 True False False 1 0
2397 0 0 1 1 0 4 0 False False False 1 0

8891 rows × 12 columns

Naive Bayes:

A classification model that uses a probability-based algorithm that classifies data based on the likelihood of the events occuring within the dataset.

Split the data into a 75-25% train-test datasets.

In [50]:
X = data_with_clusters.drop('is_spam', axis=1)
y = data_with_clusters['is_spam']
X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Based on our EDA, our data displayed behaviors of not having a normal distribution, thus, we chose to implement the Multinomial Naive Bayes model.

In [51]:
clf=MultinomialNB()
In [52]:
clf.fit(X_train, y_train)
Out[52]:
MultinomialNB()
In [53]:
y_pred=clf.predict(X_test)

The model yielded a 0.17 F1 Score. Since our objective is to minimize F1, we feel confident in our model's performance.

In [54]:
print('F1 Score: %.3f' % f1_score(y_test, y_pred))
F1 Score: 0.216
In [55]:
predictions = pd.DataFrame(y_pred, columns = ['Predictions'])
In [56]:
X_test = X_test.reset_index()
results = pd.merge(X_test, predictions, left_index=True, right_index=True)
In [57]:
results.head(25)
Out[57]:
index spam_count_content spam_count_subject subject_cl_count from_cl_count content_cl_count from_ch_count from_int_count from_dotcom from_dotedu from_dotus Clusters Predictions
0 127 70 0 2 2 208 32 0 False False False 0 0
1 4476 39 3 5 0 86 41 0 False False False 0 0
2 6314 54 1 2 0 147 14 0 False False False 0 0
3 1271 119 5 7 4 276 33 2 True False False 0 0
4 1300 133 5 9 6 249 30 0 True False False 0 0
5 431 55 0 3 3 165 32 0 True False False 0 0
6 3163 78 0 15 3 201 51 0 False False False 0 0
7 5946 77 0 4 2 262 31 3 True False False 0 0
8 5783 81 0 2 2 552 37 0 False False False 0 1
9 6636 34 6 3 0 72 27 0 True False False 0 0
10 1718 92 5 11 2 200 36 0 True False False 0 0
11 5546 81 1 1 2 304 33 0 False False False 0 0
12 5623 89 1 2 6 264 30 0 False False False 0 0
13 3755 58 1 2 0 87 14 0 False False False 0 0
14 6821 47 1 7 0 106 32 0 True False False 0 0
15 6734 32 1 0 0 83 31 0 True False False 0 0
16 1168 88 1 5 3 201 39 0 True False False 0 0
17 1368 77 2 2 2 157 33 0 False False False 0 0
18 1625 103 5 11 1 281 24 0 False False False 0 0
19 5984 149 0 4 2 298 36 0 True False False 0 0
20 4827 94 1 3 2 262 36 0 True False False 0 0
21 801 71 1 2 3 157 31 0 False False False 0 0
22 2980 103 3 0 0 446 21 0 False False False 0 1
23 1726 91 3 5 1 192 33 6 True False False 0 0
24 1960 68 1 6 2 182 30 0 True False False 0 0

The Confusion Matrix displays the distribution of the classifications with the Naive Bayes model. Around 22% of the test data was misclassified as not spam, but was actually spam, whereas 73% was correctly predicted as not spam.

In [58]:
cnf_matrix = confusion_matrix(y_test, y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[58]:
Text(0.5, 352.48, 'Predicted label')

Case Conclusions

Due to the success of the classification performed by Naive Bayes, we believe this model is appropriate for the dataset. Spam detection will never be 100% accurate, but we hope that the company in question implements this model to detect the spam emails in their employee's emails. Upon looking at the features important to the models, we can use these attributes to train new employees to identify what could potentially be spam or not.

The top 4 important features were from_dot_edu, our cluster labels from KMeans, from_dot_us, from_dot_com which the domain from which the email is sent should have an impact on the data.

In [59]:
viz = FeatureImportances(clf, relative=False)
viz.fit(X_train, y_train)
Out[59]:
FeatureImportances(ax=<AxesSubplot:>, estimator=MultinomialNB(), relative=False)

Further analysis

Since the model is not 100% accurate - missing 22% of spam emails in our testing set, we could display the emails that were not classified as spam but had a high probability of actually being spam with a soft warning. If the user designates this as spam or not, as a result of this prompt, we could use this feedback to improve the models.